fundamental frequency
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
- Media > Music (0.68)
- Leisure & Entertainment (0.68)
Exploring Dynamic Parameters for Vietnamese Gender-Independent ASR
Leang, Sotheara, Castelli, Éric, Vaufreydaz, Dominique, Sam, Sethserey
The dynamic characteristics of speech signal provides temporal information and play an important role in enhancing Automatic Speech Recognition (ASR). In this work, we characterized the acoustic transitions in a ratio plane of Spectral Subband Centroid Frequencies (SSCFs) using polar parameters to capture the dynamic characteristics of the speech and minimize spectral variation. These dynamic parameters were combined with Mel-Frequency Cepstral Coefficients (MFCCs) in Vietnamese ASR to capture more detailed spectral information. The SSCF0 was used as a pseudo-feature for the fundamental frequency (F0) to describe the tonal information robustly. The findings showed that the proposed parameters significantly reduce word error rates and exhibit greater gender independence than the baseline MFCCs.
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
- North America > United States > New Jersey (0.04)
- Asia > Vietnam (0.04)
- Asia > Cambodia > Phnom Penh Province > Phnom Penh (0.04)
Real-Time Pitch/F0 Detection Using Spectrogram Images and Convolutional Neural Networks
-- Pitch (also called F0 or fundamental frequency) is a very important voice feature for smart mobility features, such as driver's emotion detection, vehicle personalized profiles, and secured speaker identification. This paper presents a novel approach to de tect F0 through Convolutional Neural Networks (CNN) and image processing techniques to directly estimate pitch from spectrogram images. Our new approach demonstrates a very good detection accuracy; a total of 9 2 % of predicted pitch contours have strong or moderate correlations to the true pitch contours. Furthermore, t he experimental comparison between our new approach and other state - of - the - art CNN methods reveals that our approach can enhance the detection rate by approximately 5% across various Signal - to - Noise Ratio (SNR) conditions . Pitch detection is very widely used for smart mobility features. For example, as shown in Fig.1, pitch contour can be used to train a deep learning neural network for driver's emotion detection, which can alert road rage.
- North America > United States > Michigan > Macomb County > Warren (0.05)
- North America > United States > New York (0.04)
- Research Report > Promising Solution (0.34)
- Overview > Innovation (0.34)
- Transportation > Ground > Road (0.48)
- Automobiles & Trucks (0.47)
Audio-to-Image Encoding for Improved Voice Characteristic Detection Using Deep Convolutional Neural Networks
This paper introduces a novel audio-to-image encoding framework that integrates multiple dimensions of voice characteristics into a single RGB image for speaker recognition. In this method, the green channel encodes raw audio data, the red channel embeds statistical descriptors of the voice signal (including key metrics such as median and mean values for fundamental frequency, spectral centroid, bandwidth, rolloff, zero-crossing rate, MFCCs, RMS energy, spectral flatness, spectral contrast, chroma, and harmonic-to-noise ratio), and the blue channel comprises subframes representing these features in a spatially organized format. A deep convolutional neural network trained on these composite images achieves 98% accuracy in speaker classification across two speakers, suggesting that this integrated multi-channel representation can provide a more discriminative input for voice recognition tasks.
Beyond Data Scarcity: A Frequency-Driven Framework for Zero-Shot Forecasting
Nochumsohn, Liran, Moshkovitz, Michal, Avner, Orly, Di Castro, Dotan, Azencot, Omri
Time series forecasting is critical in numerous real-world applications, requiring accurate predictions of future values based on observed patterns. While traditional forecasting techniques work well in in-domain scenarios with ample data, they struggle when data is scarce or not available at all, motivating the emergence of zero-shot and few-shot learning settings. Recent advancements often leverage large-scale foundation models for such tasks, but these methods require extensive data and compute resources, and their performance may be hindered by ineffective learning from the available training set. This raises a fundamental question: What factors influence effective learning from data in time series forecasting? Toward addressing this, we propose using Fourier analysis to investigate how models learn from synthetic and real-world time series data. Our findings reveal that forecasters commonly suffer from poor learning from data with multiple frequencies and poor generalization to unseen frequencies, which impedes their predictive performance. To alleviate these issues, we present a novel synthetic data generation framework, designed to enhance real data or replace it completely by creating task-specific frequency information, requiring only the sampling rate of the target data. Our approach, Freq-Synth, improves the robustness of both foundation as well as nonfoundation forecast models in zero-shot and few-shot settings, facilitating more reliable time series forecasting under limited data scenarios. Time series forecasting (TSF) plays a critical role in various areas, such as finance, healthcare, and energy, where accurate predictions of future values are essential for decision-making and planning. Traditionally, in-domain learning has been the common setting for developing forecasting models, where a model is trained using data from the same domain it will later be deployed in (Salinas et al., 2020; Zhou et al., 2021). This ensures that the model captures the patterns, seasonality, and trends specific to the target domain, improving its predictive performance. However, a significant challenge arises when there is scarce or no historical information available for training, limiting the ability to apply traditional in-domain learning approaches (Sarmas et al., 2022; Fong et al., 2020). In such cases, the emergence of zero-shot (ZS) and few-shot (FS) learning settings offer potential solutions. Zero-shot learning enables models to generalize to new, unseen domains without requiring domainspecific data by leveraging knowledge transfer from other domains or tasks.
- South America > Argentina > Patagonia > Río Negro Province > Viedma (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Southern District > Beer-Sheva (0.04)
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.92)
- Health & Medicine > Therapeutic Area > Immunology (0.92)
- Health & Medicine > Epidemiology (0.67)
Reproducible Machine Learning-based Voice Pathology Detection: Introducing the Pitch Difference Feature
Vrba, Jan, Steinbach, Jakub, Jirsa, Tomáš, Verde, Laura, De Fazio, Roberta, Homma, Noriyasu, Zeng, Yuwen, Ichiji, Key, Hájek, Lukáš, Sedláková, Zuzana, Mareš, Jan
In this study, we propose a robust set of features derived from a thorough research of contemporary practices in voice pathology detection. The feature set is based on the combination of acoustic handcrafted features. Additionally, we introduce pitch difference as a novel feature. We combine this feature set, containing data from the publicly available Saarbr\"ucken Voice Database (SVD), with preprocessing using the K-Means Synthetic Minority Over-Sampling Technique algorithm to address class imbalance. Moreover, we applied multiple ML models as binary classifiers. We utilized support vector machine, k-nearest neighbors, naive Bayes, decision tree, random forest and AdaBoost classifiers. To determine the best classification approach, we performed grid search on feasible hyperparameters of respective classifiers and subsections of features. Our approach has achieved the state-of-the-art performance, measured by unweighted average recall in voice pathology detection on SVD database. We intentionally omit accuracy as it is highly biased metric in case of unbalanced data compared to aforementioned metrics. The results are further enhanced by eliminating the potential overestimation of the results with repeated stratified cross-validation. This advancement demonstrates significant potential for the clinical deployment of ML methods, offering a valuable tool for an objective examination of voice pathologies. To support our claims, we provide a publicly available GitHub repository with DOI 10.5281/zenodo.13771573. Finally, we provide REFORMS checklist.
- Europe > Germany > Saarland > Saarbrücken (0.14)
- Europe > Czechia > Prague (0.04)
- North America > United States > Massachusetts (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.54)
Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation
Lee, Jin Woo, Park, Jaehyun, Choi, Min Jun, Lee, Kyogu
While significant advancements have been made in music generation and differentiable sound synthesis within machine learning and computer audition, the simulation of instrument vibration guided by physical laws has been underexplored. To address this gap, we introduce a novel model for simulating the spatio-temporal motion of nonlinear strings, integrating modal synthesis and spectral modeling within a neural network framework. Our model leverages physical properties and fundamental frequencies as inputs, outputting string states across time and space that solve the partial differential equation characterizing the nonlinear string. Empirical evaluations demonstrate that the proposed architecture achieves superior accuracy in string motion simulation compared to existing baseline architectures. The code and demo are available online.
- Asia > South Korea > Seoul > Seoul (0.05)
- North America > United States > Illinois (0.04)
- Europe > Ireland (0.04)
- Asia > Vietnam > Hanoi > Hanoi (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models
Li, Xiang, Bu, Fan, Mehrish, Ambuj, Li, Yingting, Han, Jiale, Cheng, Bo, Poria, Soujanya
Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS's superiority over existing single-step speech synthesis systems, representing a significant advancement in the field.
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
- Media > Music (0.46)
- Leisure & Entertainment (0.46)
Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and Spectral Optimal Transport
Torres, Bernardo, Peeters, Geoffroy, Richard, Gaël
In neural audio signal processing, pitch conditioning has been used to enhance the performance of synthesizers. However, jointly training pitch estimators and synthesizers is a challenge when using standard audio-to-audio reconstruction loss, leading to reliance on external pitch trackers. To address this issue, we propose using a spectral loss function inspired by optimal transportation theory that minimizes the displacement of spectral energy. We validate this approach through an unsupervised autoencoding task that fits a harmonic template to harmonic signals. We jointly estimate the fundamental frequency and amplitudes of harmonics using a lightweight encoder and reconstruct the signals using a differentiable harmonic synthesizer. The proposed approach offers a promising direction for improving unsupervised parameter estimation in neural audio applications.